DOCUMENT RESUME 



ED 247 256 

AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS' 



. . ' TM 840 427 

Subkoviak, Michael J.; Harris, Deborahif^. 
A Short-Cut Statistic for Item Analysife of Mastery 
Tests: A Comparison of Thriee Procedure^. 
Apr 84 .' 

. 21p.; Paper presented at the Annual Meeting of the 
American Educatioiial Research Association (68th, New 
Orleans, LA, April 23-27, 1984). 
Speeches/Conference Papers (150) — Reports - 
Research/Technical (143) 

MFOl/PCOl Plus Postage. 

Comparative Analysis; Elementary Secondary Education; 
*Item Aj^lelysis; Latent Trait Theory; *Mastery Tests; 
;-Pretei?ts' Posttests; *Test Construction; Test Items 
IDENTIFIERS : .*Agr6ement Statist ic ( Subkoviak and Harris) 

ABSTRACT 

. TF^is study fexaminedl'^ three statistical methods for 
selecting items for mastery test^. Otijs is the pretest-posttest method 
due to Cox and Vargas (1966); it is *d?bmputationally simple, but has a 
number of serious limitations; The sedllfid i^ a latent trait method 
recommended by van der Linden (1981); it is computationally complex, 
but has a niimber of theoretical advantages. The third method, the 
agreement, statistic proposed in this paper, parallels the latent 
trait method in many raspects; but it is computationally simple, like 
the pretest-posttest procedure. A total of 81 distinct data sets were 
simulated; and the three item selection methods were applied to each 
dat^' set for the purpose of studying relationships among the methods. 
The corr.elatidn between the latent trait method and the agreement 
method was substantial, suggesting that the latter might be 
recommended as a practical alternative to the former for classroom 
use. The results for the pretest-posttest method tended to confirm 
its reputed limitations. (Author/BS) 



****************************************** 

* Reproductions supplied by EDRS are the best that can be. made * 

* from the original document. * 
*********************************************** ^^^^^^^^^^^^^^^^^^^^^^^^ 



EKLC 



4 



X A Short- Cut Statistic, for 

N 

Item Analysis of Masterv. Tests : 
A Comparison of Three Procedures 

Michael J. Subkoviak and Deborah J. Harris 
University of Visconsiit-Madison 



U.S. DEPARTMENT OF EDUCATION 

'national institute of education 

EDUCATIONAL RESOURCES INFORMATION 

CENTER lERICl 
X This documtTtj has been rcpfoducod a» 
iiic:(Mvcd from the p»?rsoh or organization 
i)ni}in;Miruj it 

Mtntjr changes have Iwan made to improve 
lepiotluction qu.ility 

• Points ol vi(;w or opinions staled m ihisdocu 
ment do not necessarily represent official NIE 
position or policy r-- 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) " 



i 



Paper presented at the annual meeting of the American Educational Research 
Association, New Orleans, April 198A. 

■ ■■ 2 



Abstract - . , 

This stVidy examined three statistical methods for selecting items for 
mastery tests. One is the pretest-posttest method 'due to Cox and Vargas 
(1966); it is computationally simple, but has a number of serious 
limitations. The second is a latent trait method recommended by van der 
Linden (1981); it is computationally complex, but has a number of theoretical 
advantages. The third metliod , proposed herein, parallels the latent trait 
method in many respects; but it is computationally simple, like the pretest- 
postfest procedure. A tctal of eighty-one distinct data sets were simulated: 
and the three item sq^ection methods were applied to each data set for the 
purpose of studying relationships among the methods. The correlation between 
the latent-' trait method and the one proposed herein was substantial, 
suggesting that the latter might be recommended as a practical alternative to 
the former. The results for the pretest-posttest method tended to confirm it- 
reputed limitations. : , 




Item Selection for Mastery Tests: A Comparison of Three Procedur^*^^ ■ ' 

. , BACKGROUm) AND PURPOSE V . * 

Mastery testing has become' increasingly bopular within the classroom* v^^v i^,^ 
during the last decade,: fpr at least two reasons. First is' the increased use 
of individualize educational 'programs ( see , for example, Havis , ,1983; ,* ."".^-V 

Harableton & Novick, 1973; Huynh; 1976).. Second, is the mOve.ment to upgrade K^.Jf" 

* • \ ■ • — ' ' ■ ■• '" - ' . ■ ' ' ' . ' * "'^■'^1' 

education and to hold teachers and school systems accountable for that which, 

• . ■ " « • . ■ ■ ■ ' '. ' . "V , % • ^ 

they claim to teach (see, for* example, Rock, "1976). , 

Mastery testing presumes a series of well-defined tasks to be assessed 
and a cutoff which distinguishes those who havje successftijly Vastered- the *' 
aforementioned tasks, and those who have not. According to Glaser- ( 1963) 
cr*iterion-ref erenced test scores should maximize the diffe?:ence between thpse 
two groups and. minimize differences within groups. For a mastery test, this' 
means selecting items that discriminate between masters and nonmasters, as 
opposed to within masters and within nonmasters. The general consensus- 
appears to be that a good mastery test item is one which masters answer 
correctly and nonmasters answer incorrectly*^(see , for example, Edwards, 1970; 
Lord, 1980); and the index proposed herein is based on this . concept . An 
abundance of statistics have been proposed for selecting items for a mastery 
test (Berk, 1980). Unfortunately, some of these statistics have conceptual 
flaws, whil^ pthers are too complex. for routine classroom use. Thus, this 
study proposes a simple item selection statistic for classroom use and 
compares it to two other statistics that have appeared previously in the 
literature (Gox & Vargas, 1966; van der Linden, 1981). 



ThgPretest-Ppst test Statisti c 

, In 1966v Cox' arid VargaS proposed a pretest-posttest statistic designed to 
select criterion-referenced test items. The pretest-posttest statistic, ' 
'designated T^pp^ ?-S computed by subtracting, th^ proportion of subjects who 
correctly respond to an item on a pretest from the proportion of subitfcts who 
correctly respond to the item on a posttest, thus enta:iling two test 
administrations. ^Cox anld Vargas demonstrated in their 1966. stiidy that the 
pretest-posttest statistic and a traditional norm-referenced statistic 
produced sufficiently different results^ to advocate use of the former in 
criterion-referenced test'ing. 

Since its introduction' in 1966, the pretest-posttest statistic has become 
a prototype for selecting criterion-referenced test items. This is primarily 
due to its simplicity both conceptually and computationally. Unfortunately, 
the pretest-posttest statistic has some serious disadvantages, as several 
authors hav^ pointed^out (e.g.. Berk, 1980; van der Linden, 1981). These 
limitations include among others: the need for two test administrations; 
problems related to administering the same item set twice; problems inherent 
to change scores; population dependency; and lack of sensitivity to the power 
of an item to discriminate at the cutoff score. Thus, while the pretest- 
posttest statistic is conceptually easy to understand and is manually 
calculable, the flaws embedded within it make its use as a method of item 
selection questionable in the context of mastery testing. 

The Latent Trait Statistic 

In 1981, van der Linden proposed a statistic which measitres ' the power/of 
an item to discriminate at a given cutoff point and which was recommended as a 
replacement for the pretest-posttest statistic as an item selection technique 



:if ies the probability that an examinee) 
3nd to an item with given difficulty, > 



for mastery tests. This latent trait statistic is based on the concept of an 
it-em characteristic curve, which specj 
with ability 9 will correctly respond 
discrimination, and guessing parameters^ In the usual case, the more ability 
a subject has, the more probable it is tb at'; he or she will correctly respond 
to the*.item. The item characteristic curve is most generally defined by: 

P.(+|e).= c^ + (1 - c^O{l + exp[.-a^(e - b^)]}""^ ;• '(1) 

where P^(+|6) Is the probability of responding correctly to' item^. i" given " 
ability level B; c^ is the item guessing ^parameter-/ a^ is the item 
discrimination parameter; and b^ is the item difficulty parameter. A 
popular si'mplif ication of Kquatioti (1), known as the Rasch model, results by 
assuming c^ = 0 and a^^ - 1 in (1): ; . , 

P.(+|a) = (1 +.exp[-(e - b.)l)/ . ^ (2) 

A mastery test requires that a criterion or cutoff score be specified to 

distinguisjl ^masters from nonmasters; and once this cutoff is Selected, its 

associated v^lue, on the ability scale being measured can be determined^ 

(see van der Linden, 1981) • Desirable items h^ve characteristic curves, given 

.by (1) or (2), with steep slope at 6 , indicating that masters have ^ much 

higher relative probability of responding correctly than, nonmasters. The 

slope is given by the derivative of the item characteristic curve at 6^ and 

is designated ?!(+ 6 ). Desirable items also have small scatter or variance 
1 ' c 

of item responses at 6 . ^ Since item responses are dichotomous, the scatter 
at equals P^(+|6^)ri - P^C+I^^)!- van der Lin^d^ (1^81) proposed an 



ixem selection index, 1,(9 )^ which comblmfes the slope and scatter of -an 

i C ' . t •! ^ * ' _ 

item at S as follows: \ • at \ - ^ ^ 

i I c . - 1 I c i \ . - , ' 

' ■ ■ " ■.•-A : 

The index givien by (3) is the value of the "item inf orm4tro;n function" at 6 
(Birnbaum, 1968); and,- roug^Sy speaking, it is a , measure of a . itfem' s power to 
discriminate between masters and tionmasters. For the Ra&ch model given bv | 
(2), index (3) reduces to tlie simple form: , ' * ^ 



van der Linden (1981) performed an empirical .study to explpre the 
relationship, between the pret^st-posttest sXat^-Stic ;T)pp and the Rasch 
statistic* l;j^(^^) given by (A). A physics unit was taught to 156 tenth grade 
subjects, and a 25 item multfiple choicer test was administer-ed as a pretest and 
a posttest. Dpp and ^Cj^C'^^) statistics were 'obtained for each item, and the 
correla'tion between the t^o statistics was computed across items. The 
correlation was .23 for one cutoff point considered and was -.19 for another 
clitoff point. Thus, the basic conclusion wa§ that the two statistics would 
tend to select very different subsets of ''desirable" items from the item pool 
employed in the study. , . 

In a more recent study (Harwell, 1^8-3) , items selected bv , the Rasch 
statistic (A) were generally found to produce more reliable tests than those 
composed of items selected by the pretest'-posttest statistic, when the two 
selection methods were compared ""Across multiple data sets. Thus, the results 



of this study provide additional evidence' for preferring the latent trait 

* • 

statistic to the pretest-posttest statistic for mastery test item selection. 

Despite its apparent theoretical and empirical superiority, the latent 
i:rait statistic is not without its own disadvantages. Its computation 
requires the use of a computer and appropriate software; it requires a 
background in latent trait theory to understand and interpret; it generally 
requires relatively large sample sizes (van der Linden, 1981); and the special 
case Qf the Pasch model makes some stringent assumptions that probably are not 
met in practice (s>ee, for example, Hambleton & Cook, 1977). Thus, while it is 
to be preferred on theoretical . and empirical grounds, it is impractical for 
classroom use in certain rejspects. 

V ' 

The Agreement Statistic 

While the latent trait statistic appears to be the best available index 
for .selecting criterion-referenced test items in terms of concep^tual 
compatability with the purpose- of mastery testing, its computational and 
conceptual complexity is not entirely suitable for classroom use. Conversely, 
. the pretest-posttest statistic is .manually calculable, but is inappropriate * 
for reasons previously noted. Thus, another statistic, designated P(X^),i is 
proposed here as an item selection index, more aligned with' the latent trait 
Statistic than is the pretest-posttest statistic, but still manually 
calculabiX. 

P(X^) is corapiite^i from Table 1, with mastery/ncJfimastery status' 
(determined by total test score) and item response (correct/incorrect), as 
marginal categories. 



-Insert Table 1 about here 



specifically, P(X^) is defined as: 



PCX ) = I , . (5) 

C N ^ 



where ^a^^ is the number of masters passing the item, a22 is the number of 
nonmasters faiH^g the item, and N is the total number of examinees. 
Thus, P(X^.) canNbe interpreted as the probability of agreement between 
outcomes on an item and outcomes on the total test (Goodman ^ Kruskal, 



1954). Ideal items would have PCX^,) values equal to one. The practical 



J 



lower bound of P(Xc^ > when no relationship exists between mastery status and 
item' response, may be computed as: 



Purpose of the Study * . 

For reasons noted previously (Harwell, 1983; van der Linden, 1981), the 
latent trait index was viewed as the theoretically ©referred statistic in the 
present study. The basic purpose of the study was to determine whether final 
test forms selected by the latent trait statistic (A) and by the agreement 
coefficients (5)' are sufficiently similar to advocate il^e of the latter in 
classroom applications. Given the current popularity of the pretest-posttest 
statistic, its relationship to indices (A) and (5) was also considered in the 
study. 



• . METHOD 

To compare'the three item, selection statistics, a simulation study was 
designed to examine a variety of test conditions. The computer program GENIRV 
(Baker, 1982) was used to generate dichotomous item response data for tests of 
30," 50, and 100 items "administered" as pretests and posttests to samples of 
30, 60, and 120 subiects. ^ 

A two-parameter latent trait model, with c^ = 0 ln""0), was used to 
generate the data. Item difficulty parameters b^ were randomly selected 
from a uniform distribution with values ranging from -3,0() to +3,00; item 
discrimination parameters a^ were randomly selected from a uniform 
distri.bution with values ranging from +.30 to +1,25. These paramet*er values 
were kept constant across both- pretest and posttest within each data set; but 
'the ability level of the- examinee group increased from pretest to posttest, as 
discussed next. 

Subiects' abilities 9 were randomly sampled from normal distributions 
with values ranging from -3,00 to +3,00, Since the pretest-posttest statistic 
is reliant on the difference between the pretest and posttest ability 
distributions, mean differences of ^1 , 2, and 3 standard deviations between the 
two distributions were simulated. More specifically, three levels of pretest 
knowledge were simulated by sampling abilities from normal distributions 
paving respective means and standard deviations of 0 and 1; -I and ,75; and -2 
and ,50. The standard deviation was decreased as the pretest mean decreased 
to simulate a "floor effect", as often occurs in real data. Posttest 
abilities were then simulated by drawing random samples- frbm a normal 
distribution with mean 1 and standard deviation 1. 

Crossing the three simulation factors (number of items, number of 
examinees, and pretest-posttest mean differences) resulted in /3 x 3 x 3 = 27 

EMC , 10 ■ 



different conditions; and each condition was independently replicated 3 times 
resulting in a total df 3 x 27 = 81 test data sets being generated • ^ For 



each test data set, cutoff scores were set at 75% and 85% of the test items 

correct* For each data set and each cutoff, three statistics were then. 

computed: pretest-posttest , latent trait (A), and agreement (5). 

Once these statistics were computed, the items within each data set were 

ranked on the basic of each statistic according to the order in which they 

would be selected for a mastery test. T)ue to. the apparent superiority of the 

latent trait statistic (jfHa rwell ,j 1983; van der Linden , 1981), it was used as 

basis for comparison. Spearman rank ord.er correlations \jere tliiis 'computed 

between the ranks of the pretest-posttest and^^the latent trait statistic, and 

also between, the ranks of the agreeirent and the latent trait statistic. In 

order to determine if a statistically significant difference existed between 

these two correlations across the various condition simulated, a split plot 

* 

ANOVA was performed (see Kirk, 1982). 

VThile running the split plot analysis would establish the existence of a 
significant difference in the way the pretest-posttest statistic and the 

agreement statistic correlate with £he t^ieoretically preferred latent trajt 

I- 

statistic, additional analyses were required to determine if the agreement 
statistic is a suitable substitute for the latent trait statistic. For 
example, the agreement statistic might correlate more highly with the latent 
trait statistic than the Pretest-i^^sttest statistic, yet still not correlate 
highly enough to be an adequate substitute for the latent trait statistic. 
Therefore, two additional analyses were performed to determine if the overlap 
between the latent trait and agreement statistics was large enough to' advocat 
the use of the flatter, in classroom settings. . First, the individual 
correlations were examined to determine the degree of^ similarity between thes 



two statistics. The second supplemental "analyses involvjad detei?niininR the 
amount of overlap between item sets selected by the two statistics*, when 50% 
of the initialjitem pool\ was sel^ct^, for ^ final test form. , 

RESULTS . ^ • ^ ■ ' 

\ »•■.»' . ■' , . si 

The cell means and srtandard dQvilKtions of correlation valuer across 3 « 
replications of the various test; conditions ar^ presented in Table -Z. In the 
split plot analysis of the da'ta*, th^ effect f or correlat ion-by-method* was 



significant at the a = .05' .leveli; and n ^£013; this effect .was equal t.<5) .87, 

- •■ ■ ■ ) ■ ■ ■• „■ 

meaning 87% of the Variance - in -the data was explairiie^ by the two item 
selection methods that were correlated with the latent trait* method. It can 
thus be concluded that a si^ificant differe^nce does*^ exist in- the way the 

. \ ' ^ ■' • ^ . V ■ , 

agreement statistic and pretest-ppsttest statistic corr'elate wirth the latent 
trait statistic. In all casess, the mean correlation between the latent trait 
statistic and agreement statistic exceeded the' correlation of ^the latent trait 
statistic and the pretest-posttest statistic in Table 2. 

^' .■ - ■ • \ .... ■[ 



Insert Table -2 about here 



With 87% of the variance accounted for *by the primary effect of interest 
in the ANOVA, relatively little variance was left^ to be accounted for by the 
other effects and interactions. While some of these effecjj^, such as those '■ 

. ■ . • ■ . V ' , " , 

due to numb^^r of items and to numbe'r of examinees, were statijstically 
significant, the associated variance accounted for by these effe^t;^s T^as of 
little practical significance. The reader interested in these secondary 
details is referred to Harris (1983). " 

An examination of the individual correlations summarized in Table 2 
revealed that the average correlation between the latent trait statis^tJLc and 



the agreement statistic was ,91, sug^testing that- the latter may generally be a 
reasonable substitute for the former. In contrast, the average correlation 
between the latent trait statistic and the pretest-posttest statistic was 
-.17. An interesting finding in the data was that a" maiority of the 
correlations between the pretest-posttest statistic and the latent trait 
statistic were negative, meaning that the items the latent trait statistip 
tended to select first are it^ms the pretest-posttest statistic tended to 
select last. A tentative explanation for this result follows. 

Recall that the pretest-posttest statistic is computed by subtracting the 
proportion of examinees who respond correctly to an item on the pretest from 
the proportion who respond correctly on the posttest. As such, difficult 
posttest items tend to be selected last by this method, because the pretest- 
posttest difference tends to be small. Conversely, difficult posttest items 
tend to be selected first by the latent trait method, because the cutoffs 
in the present study correspond to high ability levels. Thus an item which 

discriminates well at 0 (which is the basis by which the latent trait 

c 

statistic selects items) would tend to be a difficult posttest item in the 
present study. Therefore, difficult posttest items would tend to he selected 
first by the latent trait statistic and last by the pretest-posttest 
statistic. 

The final analysis involved computing the proportion of overlap between 
item sets selected by the latent trait and agreement statistics, when 50% of 
the items in the initial item pool were selected for a final test form. The 
results are summarized in Table 3. The average proportion of overlap across 
all conditions in Table 3 was .94. 



Insert Table 3 about here 

ERIC 10 13 



The results of this analysis, coupled with the preceding examination of 
correlation values in Table 2, suggest that the agreement statistic and the ' 
latent trait statistic perform similarl^^ enough for the' former, due to its 
conceptual and computational simplicity, to be tentatively recommended for 
further study and for possible classrpom use, 

DISaiSSION 

The purpose of this study was to compare three methods of item selection 
which might be considered for mastery tests. The' pretiest-posttest statistic 
is computationally- and conceptually simple, but also has serious 
limitations. The latent trait statistic is a desirable item selection method 
both in its theoretical alignment with the purpose of mastery tests 
(distinguishing masters from nonmasters at the cutoff score") and in yieli^ing 
more reliable tests than those constructed by the pretest-posttest statisti^c. 
(Harwell, 1983); however, the latent trait statistic is complex. Thus, the 
agreement statistic was proposed as a possible alternative for classroom use. 

Items selected by the agreement statistic were fo^nd to correlate highly 
with items selected by the latent trait statistic; the mean correlation, 
across all simulation conditions, was ,91; and the average proportion of 
overlap between selected item sets was .94. The item cha^racteristic curve in 
Figure 1 provides a basis for understanding the close relationship between the 
latent trait statistic and the agreement statistic (Baker , personal 
communication). With ability on the horizontal axis and the probability of 
correct response given ability 6 on the vertical axis, an item 
characteristic curve is plotted. The cutoff score 6^ divides the ability 
scale into masters (M) and nonmasters (NM). 



11 14 



The item characterist ic^urve is divided into fout areas I-TV, which may 
be identified' with the four cells in Table 1. In Figure 1, the area, below the 
item ^harajbterist ic curve is viewed as corresponding to those examinees 
correctly responding to the item. Thus, area I corresponding to those 
examinees who are masters (have ability levels to the right of the ^^)" and ^ 
jwho correctly respond to the item. In Table 1, aj^ is likewise the number 
of ekaminees who are classified as masters in terms of their total test score 
and who respond correctly to the item. Similarly, it can be seen that II 
corresponds to aj2» HI to 3i2l * ^22* viewing the item 

characteristic curve' in Figure 1 it may also be noted that the steeper the 
slope at 9^, the better the item discriminates between masters and 
nonmasters, and the larger the areas I and IV become. Since the latent trait 
statistic selects items on the magnitude of the information function at 9 

c 

(see 3 or 4) and the agreement statistic selects items on the magnitude 

of (sL^i ^ ^22^^^* seen from the item characteristic curve that the 

two statistics will be selecting much the same items. 



Insert Figure 1 about here 



ERIC 12-^5 



REFERENCES 

Baker, F. B. (1982). GENIRV: A program to generate item response vectors . 

Unpublished manuscript, University of Wisconsin-Madison. 
Baker, F. B'. ( 1983 , August) . Personal communication. 

Berk, R. A. (1980). Item analysis. In R. A. Berk (Ed.), Criterion- - 

referenced measurement; The st^te of the art . Baltimore, MD: ^ Johns 

Hopkins University Press. 
Birnbaum, A. (1968)., Some latent trait models and their uses in inferring an 

examinee's ability. In F. M. Lord & M. R. Novick, Statistical theories 

of mental test scores . Reading, MA: Addison-Wesley . 
Bock, R. D. (1976). Basic issues in the measurement of change. In ,D. M. 

De Gruijter & L. J. Van der Kamp (Eds.), Advances in psychological and 

educational measurement . New York: Wiley. 
Cox, R. C, & V-argas, J. S. (1^66, February). A comparison of item selection 

techniques for norm-referenced and criterion-referenced tests . Paper 

presented at the annual meeting of the National Council on Measurement in 

Education, Chicago. 

Davis', G. A. (1983). Educational psychology . Reading, MA: Addison-Wesley. 
Edwards, A. L. (1970). The measurement of personality traits by scales and 

inventories . New York: Holt, Rinehart and Winston. 
Goodman, L. A., & Kruskal , W.^H. (195A). Measures of association for cross 

classifications. Journal- of the American Statistical Association , 49 , 

732-76A. 

Hambleton, R. K. , & Cook, L. L. (1977). Latent trait models and their use in 
the analysis of educational test data'. Journal of Educational 
Measurement , 14, 75-96. 



16 

13 



Hambleton, R. K., & Novick, M. R. • Toward an integration of theory 

and method for criterion-referenced tests. Journal of Educational 
Measurement , 10 , 159-170. \^ . • 

Harris, D. J.^ (1983). Item selection fori mastery tests: ' A <!omparison of 

■ ->J • ■ ■ T 

three procedures . Unpublished doctoral dissertation, Universityj or 

Wisconsin-Madison. * I 

Harwell, M. R. -(1983). A' comparison of two item selection procedures in 

criterion-referenced measurement . Unpublished doctoral dissertation^' 

University of Wisconsin-Madison. 
Huynh, H. (1976). Statistical consideration of mastery ^^res. 

Psychometrika , 41 , 65-78. | 
Kirk, R. (1982). Experimental design . Belmont, CA: Brooks/Cole. 

Lord, F. M. (1980). Applications of item response theory to practical | 

testing problems . Hillsdale, NJ: Erlhaum. 

van der Linden, W. J. (1981). A latent trait look at pretest-posttest | 

■ ' * j 

validation of criterion-referenced test items. Review of Educational 

Research, 51, 379-402. 



17 



•Table 1 

Contingency Table for Computing P(X^) 





It 

master 


nonmaster 


correct 


^11 - 


^12 








incorrect 


^21 


^22 



ERIC 



18 



15 



Table 2 



Means and Standard Deviations of Correlation Values^ 



75% cutoff 85% cutoff 



Items" Examinees 


Mean 
Dif f^^ence 


A^creement & 
Latent Trait 


Pre-Post & 
Latent Trait 


Agreement '>^& , 
Latent Trait 


Pre-Post &. 
La'tent Trait 


30 


1 


.59( .52) 


.05( .46)' 


.94(.07) 


-.08(.43) 




2 


'.83( .12) , 


-.20(.38)" 


.94 (.08) 


-.30( .33) 




3 


.87( .07) 


-.45(.31) 


. .'98(.01) 


-.63C.26) 


30 60 


I 


.81(.09) 


.30(.22) 


.'98(.02) 


.15(.09) 




■ 2 


.77( .08) 


.06( .27) 


.98(;01) 


-.11( .22) 




3 


.90(.01) 


-.64( .07) 


.98( .01) 


-.73( .10) 


- 120 


1 


.93(.00) 


.16( .28) 


.99( .00) 


■ .03( .25) 




2 


.90( .02) 


-.03( .07) 


.99(.01) 


-.22( .08) 




3 


.88( .06) 


-.45( .25) 


.99( .00) 


7.35(.59) 


30 

50 60 


.1 
2 
3 

1 
2 
3 


.7'6( .19) 
■ .90(.02) 
.8-7( .07) 

.PK ,03) 
.86(-.'07) 
.90( .01) 


.15(.12) 
-.13(.06) 
-.45(.31) . 

.27( .08) 
.08(.13) 
-.64(.07) 


.9R( .01) 
.9 5 (-.05') 
.9R( .01) 

.97(.01) 
.98(.0l) 
.98( .01) 


.15( .07) 
-.11( .16) 
-.63( .26) 

.13(.05) 
-.07( .07) 
-.73( .10) 



120 1 .91(.05) .51(.23) .97(.03) .36(.?.8') 

2 .85(.03) .28(.23) .98(.01) .03(.23) 

3 .88(.06) -.45(.25) .99(.00) -.35(.59) 



30 


1 


.74( .13) 


.02(.07) 


.99(.01) 


-.32( .50) 




2 


.76(.13) 


-.33( .27) 


.98( .02) 


-.43(.23) 




3 • 


.81( .07) 


-.54( .11) 


.99(.00) 


~ -.66(.07) 


100 60 


1 


.81(.10) 


.37( .13) 


.92(.06) 


-.02(-.34) 




■ 2 


.91(.01) 


-.12(.50) 


.99(.00) 


-.25(.47) 




.3 , 


.90C.01) 


-.34(.19) 


. .99(.00) 


-.46(.21) 


120 


1 


.92(.o'^) 


.24( .09) 


.9^9 (.00) 


.13( .10) 




2 


.91(.01) 


-.25(.37) 


.99(.00) 


-.27( .39) 




3 


.87( .04^) 


^.38(°.06) 


.97(.05) 


-.52( .05) 



^Means and standard deviations (in parentheses) were computed over three 



repli<:ations of each condition. 

19 

16 



Table 3 



Means and Standard Deviations: Proportion of Overlap 
for-Items Selected by Latent Trait and Agreement Statistics^ 







Mean 


75?;' cutoff 




Items Examinees 


Difference 


85% cutoff 


> * 
• . 30 


1 


.73(.35) 


.9l( .10) ' 




z 


QQ( DR^ 


9S<' 04'i 




3 


.91( .10) 


.95(^04) 


30 60 


1 


.95( .04) 


.95(.04) 




2 


1 .00( .00) 


.98( .04) 




3 


.98( .04) 


1.00( .00) 


120 


1 


.84( .08) 


.98(.04) 




0 

£. 


QS*" 04V 


.98 f .04) 




3 


.93(.00) 


.98( .04) 


30 


1 


.88( .00) 


.96(.04) 




z. 








3 


.91(.02) 


.92( .07) 


50 . 60^ 


1 


.93( .05) 


.96(.00) 




z 






■ . . A- 


3 


.9 5 (.06) 


.99 ( .02) 


120 


1 


-.97(.02) 


.96( .04) 




2 


.93( .05) 


.97( .02) - 




3 


.91( .10) 


.97(.02) 


30 


1 " 


.Ob; 


.y^ .uz ; 




2 


.a7( .11) 


.9^( .03) 




3 


.93( .02) 


.98( .02) 


100 60 


1 


.95( .03) 


.98( .02) 




2 


.93( .06) 


.98( .01) 




. 3 


.96(.02) 


.97.( .04) 


120 


1 


.95( .02) 


.97(.01) 




2 


.91( .06) 


.96( .02) 




3 


.98(.00) 


.97( .01) 


^Means and standard deviations (in parentheses) 


were computed 


over three 


replications of each 


condition. 








17 


20 






2] 

18 



